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Abstract 

We present the results of a self-consistent, unified molecular dynamics study 
of simple model heteropolymers in the continuum with emphasis on folding, 
sequence design and the determination of the interaction parameters of the 
effective potential between the amino acids from the knowledge of the native 
states of the designed sequences. 
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A fundamental problem in molecular biology is that of protein folding. In a coarse- 
grained description, a protein may be thought of as a heteropolymer made up of amino acids, 
twenty kinds of which occur in nature. Broadly speaking, there are several issues that are of 
vital importance. One of these is the direct folding problem: given the effective interactions 
between amino acids and an amino acid sequence, what is its native state conformation or 
ground state structure? This is a crucial question because the functionality of a protein is 
controlled by its native state structure. A second issue is that of inverse-folding or the design 
problem: again given the effective interactions, what is the sequence of amino acids that 
would have as its native state conformation a desired target structure? A further requirement 
for a functionally useful protein is that the native state conformation be thermodynamically 
stable and kinetically accessible from a typical random conformation. Recent studies have 
shown the role played by a folding funnel in the energy landscape in facilitating rapid folding 
An underlying question relevant to both the direct and inverse folding problems is how 
one may determine the effective interactions between the amino acids from knowledge of 
the native state structures of several sequences, e.g. from the Protein Data Bank (PDB). 
Considerable progress has been made in addressing each of these issues computationally but 
not in an unified manner and usually within the framework of simplified lattice models . 

This letter presents a summary of the results of a comprehensive and unified study of 
all of the above issues with an off-lattice model of heteropolymers. Our study provides an 
important test of the feasibility of the implementation of various strategies in a realistic, 
albeit simplified framework. 

A conformation of a chain made up of N residues is defined by the coordinates rj , . . . , r ^ 
of beads in three-dimensional space. For a real protein, the beads may, for example, represent 
the C a atoms of the amino acids. We consider only effective two body forces between 
amino acids obtained on integrating out the degrees of freedom associated with the internal 
coordinates of each residue and the solvent. A simple choice for the interaction potential is: 

^ = 5M + i/K,) + r/((-) 12 -(-) 6 ), (1) 
where = | — | is the inter-residue distance. 

The parameter rj entering this equation controls the energy scale, whereas a determines 
the interaction length between monomers. The values of a and 77 have to be adjusted 
to fit both the complex interactions between the various groups of amino acids and the 
interactions with the solvent. Furthermore, these parameters could depend on the different 
types of aminoacids involved in the interaction. 

The energy function of the peptide bond is chosen to be 

f(x) = a(x - d ) 2 + b(x - d ) 4 , (2) 

with a and b taken to be 1 and 100 respectively. We add a quartic term to the usual quadratic 
one |J because a plain harmonic potential could induce energy localization in some specific 
modes, significantly increasing the time needed for equilibration. 

The parameter do, which represents the equilibrium distance of the nearest neighbors 
along the chain is set equal to 3.8A, the experimental value for the mean distance between 
nearest neighbor C a atoms along the chain in real proteins, as determined from the PDB. 

The Hamiltonian is given by: 
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ff = Ey + EE% 

j=l j=l j>i 



(3) 



The first term is the classical kinetic energy of the chain, where the pj's are the canonical 
variables conjugated to the r^'s. 

We have used Molecular Dynamics (MD) (entailing the integration of Newton's laws of 
motion on a computer) for simulating the kinetics of the chains. We employed an efficient 
and precise symplectic algorithm in which one varies the energy density e = E/N, which 
is related to the temperature ||. 

Our computations were carried out in three stages: 

1) On a lattice, one usually selects a compact conformation and attempts to design a 
sequence that has this structure as a thermodynamically stable ground state. Off-lattice, 
there are an infinite number of conformations almost all of which are not designable (i.e. 
there is no sequence which has the conformation as its ground state). Our first goal was to 
select a number of compact, designable conformations. We accomplished this by beginning 
with a homopolymer model (just one kind of amino acid) with overall attractive interactions 
between pairs of monomers. For the homopolymer case, we fixed the parameters r] = 40 
and a = 6.5A. Such a value for a ensures that, in practice, two monomers significantly 
interact with each other when their distance is smaller then 9 A. Such a distance threshold is 
conventionally used for the bond between amino acids and is determined by the requirement 
that the average number of C a — C a contacts for each amino acid is roughly equal to the 
respective numbers obtained with the all-atom definition of contacts ||. 

For a three dimensional homopolymer made up of 30 identical monomers, twenty com- 
pact conformations (the radii of gyration varied between 7.52 and 7.59 A) with low-lying 
energies were obtained performing MD simulations in a slow-cooling mode. The system was 
equilibrated after successive cooling on lowering the temperature each time by a factor of 
0.8. The compact conformations were chosen so that they had little structural overlap with 
each other. The distance D between two 3-dimensional conformations is given by 

^X>-*l) 2 , (4) 

JV i=i 

where one structure is translated and rotated to get a minimal D ||. Two conformations 
were assumed to be different if the D between them exceeded 1A based on the experimental 
resolution of protein structures JT]. 

The mean distance between a given conformation and the 19 others ranged between 
6.67A and 7.49A. Strikingly, the resulting structures possess secondary motifs, especially 
helices (see Figure 1). The appearence of secondary motifs is not a general phenomenon 
but is linked to the our choice of the length parameters a and d . The relation between the 
ratio a /do and the nature of the conformation of low- lying energy states will be discussed 
elsewhere flQfl . 

2) We next switched to a heteropolymer model employing four types of amino acids, two 
of which were predominantly hydrophobic and thus mutually strongly attractive (on inte- 
grating the solvent degrees of freedom) and the other two were hydrophilic. The composition 
of the sequence was constrained so that there were 6 amino acids each of the hydrophobic 
types and 9 amino acids each of the third and fourth kinds. The interaction potential had the 
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same form as before but with ten parameters characterizing the Lennard- Jones interactions 
between the four types of amino acids chosen by hand (see Table |) . 

Small variations in the Lennard- Jones length parameter (5 possible values of a equal to 
6.25, 6.5, 7.0, 7.5 and 8.0 A were now permitted unlike the homopolymer model in which 
just the 6.5 A value was employed) that were amino acid-type independent but allowed 
for the stabilization of the target native states were allowed to take into account approxi- 
mately the diversity of sizes of amino acids in nature. A simplified design procedure due to 
Shakhnovich and Gutin [|J was carried out in which the designed sequences are chosen so 
as to minimize the energy in the target conformation and entailed an optimal assignment of 
amino acid type to each monomer and independently a choice of the a parameter for each 
i — j pair with \i — j\ > 1. The design procedure was carried out for each of the twenty 
conformations deduced using the homopolymer model and was validated by detailed simu- 
lations which showed that the designed sequences do indeed have the target conformations 
as their ground states. We slowly cooled each designed sequence several times (typically 
50) from different random initial conditions. From this procedure, we confirmed that the 
target conformations are indeed the lowest energy structures. These cooling simulations 
also generate a set of alternative, higher energy, metastable conformations (2-5 for each 
sequence) that, when perturbed, "decay" to the global minimum conformation (the target 
structure). The energy landscape is modified by the design procedure and a folding funnel, 
that promotes thermodynamic stability and kinetic accessibility is created as shown for one 
of the designed sequences in Figure 2. 

3) We then set about to determine the effective parameters of the potential of interaction 
between the four kinds of amino acids using the knowledge of the test bank of sequences and 
their known native structures. The basic idea is to require that the energy of the designed 
sequences in their ground states be less than their energies in alternative conformations. 
This is simply a consistency requirement for the definition of a ground state (native state). 
Recently, we have carried out studies |TT[ of an optimization method for the determination 
of effective potential energies of interaction and have extensively tested it on lattice models. 
The method selects the optimal parameters such that the smallest energy gap (chosen among 
the set fl s of sequences {cr s }s=i,...,20 m the training set) between the energy of the sequence 
in its native conformation and the (higher in energy) minimum energy alternative one 
is as large as possible. This additionally promotes thermodynamic stability and may be 
implemented by defining a cost function F gap : 

7-1 ■ • E(T,a s )-E(r-,a s ) 

Fga P = -mm^^mtn^rn) (5) 

and by choosing the parameters values of the potential that minimize this cost function. 



Note that this is a slightly modified version of the equation in [11] - the key new feature is 



the presence of the denominator \E(T™, a s ) \ which serves to rescale the energy gap associated 
with a given sequence with respect to its ground state energy. 

In lattice models amenable to exact enumeration, all conformations other than that of 
the native state can be conveniently used as alternative or decoy conformations. In an off- 
lattice model, there are potentially an infinite number of such decoy conformations. We 
started by taking as a set of alternative trial structures, the native-like ones obtained from 
the homopolymer studies and metastable structures. We thus used 90 different decoy con- 
formations - 19 of the 20 basic structures obtained from the homopolymer model, excluding 
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the native structure itself, and 71 from the alternatives generated by the repeated cooling 
process (as described above). We proceeded to determine rough values of the potential 
parameters that minimize the cost function (eqn. 5) with respect to these set of decoy 
structures. The generic Lennard- Jones form with unspecified parameters rj and a was used 
as the trial potential function for the interactions between amino acids with one of the 77 
values fixed to its true value in order to set the energy scale. 

A key ingredient for the success of the procedure is the use of decoy conformations that 
are significant competitors to the native state in housing the given sequence. To add relevant 
conformations to the decoy set, we used the extracted parameter values to slowly cool each 
sequence about 5 times. Initially, when non-optimal values of parameters are used, the 
simulations lead to lowest energy conformations that differ from the true ones for almost all 
the sequences. We used these as additional decoy structures in order to iteratively refine the 
parameters of the potential. We iterated the procedure until it converges self-consistently, 
i.e. until a cooling simulation with extracted parameters leads to the true ground state. 
This procedure converges very nicely and yields values in excellent accord with the chosen 
potential parameters (Figure 3). The final iteration step is obtained using a decoy set of 
1631 structures (i.e. 19 of the 20 basic structures and 1612 alternative ones). 

Taken together, these steps lead to a unified and entirely self-consistent description of 
perhaps the simplest off-lattice model of heteropolymer chains and opens the way for similar 
studies of small segments of real proteins. An important feature of the study is that a simple 
known potential was used for the design studies and therefore facilitated the verification of 
the potential parameters that were subsequently determined. This luxury is not present 
for similar studies on real proteins, for which the potential energies of interactions between 
amino acids are truly unknown. 

In summary, we have carried out a comprehensive study of the principal issues involved in 
the protein folding problem using a simple off-lattice model in three dimensions. Our results 
include the observation of secondary motifs in the native state structures, the creation of a 
folding funnel in the energy landscape of designed sequences and successful tests of folding, 
design and the extraction of parameters characterizing the interaction potential between the 
amino acids. An application of these strategies to coarse-grained models of short proteins 
seems eminently feasible. 

One of us (CC) is grateful to Giovanni Fossati for helpful advice. This work was sup- 
ported by grants from NATO, NASA, The Center for Academic Computing and the Applied 
Research Laboratory at Penn State. 
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TABLE CAPTIONS 



Table Q Values of parameters 77 of the Lennard Jones interaction potential used in the heteropolymer 
model for the four types of aminoacids. 

FIGURE CAPTIONS 

Fig.l One of the compact structures obtained for the homopolymer model. Note that the structure 
exhibits a helix having a complete turn in 10-12 beads, whereas in naturally occuring proteins, helices have 
3.6 residues per turn. The designed sequence for this conformation is: 

31111444234431221222443433333 4. 
There are 65 contacts having a a value for equilibrium distance equal to 6.25 A, 41 with a equal to 6.5 A, 
12 with 7 A, 15 with 7.5 A, and 48 with 8 A. 

Fig. 2 The energy landscape of one of the designed sequences derived from the conformations obtained 
during numerous dynamical runs of slow cooling. The energy of each conformation is plotted as a function 
of its distance (see eqn. ||) from two fixed "reference" conformations. 

Fig. 3 Extracted parameters of the potential versus the true parameters as obtained in the last iteration. 
Dark circles represent the 77 parameters, whereas the light squares denote the a parameters. 
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TABLES 
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TABLE I. 
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